Have you ever wondered how different the menus are from all the fastfood companies? From McDonald’s, Burger King, to KFC. Are they really that different from each other or they share similar recipes? Well, we are going to find out! In this project, we will build different machine learning models to classify fastfood companies based on nutrition facts from their menus. Different techniques, such as Principal Components Analysis (PCA), are adopted to help the model make better predictions. Various models with different values for hyperparameters are tuned to train the best model.
Before diving into the project, let’s make sure we understand nutrition facts. These labels are everywhere nowadays, but not too many people understand what they really mean. Let’s break it down a bit.
After this breakdown, do you understand better the nutrition facts of food we consume daily? Especially for fast food, they tend to be high in sodium, sugar, saturated fats, trans fats, calories, and processed preservatives that are bad for our health. Therefore, to maintain a healthy liftstyle, it is important to limit our consumption of fast food. (I have to admit they are YUMMY though!) Without further due, let’s dive into all the menus from these companies and find out how different they are!
The data was taken from the Kaggle dataset Fast Food Nutrition, collected by Joakim Arvidsson.
The dataset contains a decent amount of blank spaces to represent missing values, yet for some reasons R cannot detect these values, so I manually downloaded the dataset and replace those missing values with -99 to replace them with NA when reading the file
Before training our model, we should understand the dataset and handle any uncommon data to avoid unexpected behavior of the model. Let’s load our dataset first!
fastfood <- read.csv("FastFoodNutritionMenuV2.csv",
na=c("", " ", "-99"),
colClasses=c(rep("character",14))) %>%
mutate(Company = factor(Company, levels = c("McDonald’s", "Burger King", "Wendy’s", "KFC", "Taco Bell", "Pizza Hut")))
colnames(fastfood) <- c("Company", "Item", "Calories", "Calories_from_Fat",
"Total_Fat_g", "Saturated_Fat_g", "Trans_Fat_g",
"Cholesterol_mg", "Sodium_mg", "Carbs_g",
"Fiber_g", "Sugar_g", "Protein_g",
"Weight_Watchers_pnts")
str(fastfood)
## 'data.frame': 1144 obs. of 14 variables:
## $ Company : Factor w/ 6 levels "McDonald’s","Burger King",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Item : chr "Hamburger" "Cheeseburger" "Double Cheeseburger" "McDouble" ...
## $ Calories : chr "250" "300" "440" "390" ...
## $ Calories_from_Fat : chr "80" "110" "210" "170" ...
## $ Total_Fat_g : chr "9" "12" "23" "19" ...
## $ Saturated_Fat_g : chr "3.5" "6" "11" "8" ...
## $ Trans_Fat_g : chr "0.5" "0.5" "1.5" "1" ...
## $ Cholesterol_mg : chr "25" "40" "80" "65" ...
## $ Sodium_mg : chr "520" "750" "1150" "920" ...
## $ Carbs_g : chr "31" "33" "34" "33" ...
## $ Fiber_g : chr "2" "2" "2" "2" ...
## $ Sugar_g : chr "6" "6" "7" "7" ...
## $ Protein_g : chr "12" "15" "25" "22" ...
## $ Weight_Watchers_pnts: chr "247.5" "297" "433" "383" ...
Currently, there are 1144 rows with 14 columns in the dataset, which
means there are 1144 different food options and 14 possible predictors.
One of the variables, Company is our response.
Item doesn’t provide any useful information in predicting
the Company, so we will not use this predictor, which leaves us with 12
predictors, including Calories,
Calories from Fat, Total Fat(g),
Saturated Fat(g), Trans Fat(g),
Cholesterol(mg), Sodium(mg),
Carbs(g), Fiber(g), Sugars(g),
Protein(g), and Weight Watchers(Points). In
addition, all of the columns have type chr, so we need to
convert them to numeric in order for the model to process
these data properly.
Let’s take a closer look at each predictor and make sure all uncommon data is handled properly.
fastfood$Calories %>% factor() %>% levels()
fastfood$Calories_from_Fat %>% factor() %>% levels()
fastfood$Total_Fat_g %>% factor() %>% levels()
fastfood$Saturated_Fat_g %>% factor() %>% levels()
fastfood$Trans_Fat_g %>% factor() %>% levels()
fastfood$Cholesterol_mg %>% factor() %>% levels()
fastfood$Sodium_mg %>% factor() %>% levels()
fastfood$Carbs_g %>% factor() %>% levels()
fastfood$Fiber_g %>% factor() %>% levels()
fastfood$Sugar_g %>% factor() %>% levels()
fastfood$Protein_g %>% factor() %>% levels()
fastfood$Weight_Watchers_pnts %>% factor() %>% levels()
fastfood %>%
filter_all(any_vars(. %in% c("5.5 g"))) %>%
nrow
## [1] 1
fastfood %>%
filter_all(any_vars(. %in% c("<5"))) %>%
nrow
## [1] 14
fastfood %>%
filter_all(any_vars(. %in% c("<1"))) %>%
nrow
## [1] 31
As we can see, there are some data cleaning we need to do.
5.5 g value in Saturated_Fat_g<5 values<1 valuesThese data will be lost when we convert the columns from character to
numeric type. I chose to replace 5.5 g to 5.5,
<5 to 2.5, and <1 to
0.5. After replacement, I can convert the column types to
numeric.
fastfood[fastfood=="<1"]<-"0.5"
fastfood[fastfood=="<5"]<-"2.5"
fastfood[fastfood=="5.5 g"]<-"5.5"
fastfood <- fastfood %>%
mutate_at(c('Calories', 'Calories_from_Fat', 'Total_Fat_g', 'Saturated_Fat_g', 'Trans_Fat_g',
'Total_Fat_g', 'Cholesterol_mg', 'Sodium_mg', 'Carbs_g', 'Fiber_g', 'Sugar_g',
'Protein_g', 'Weight_Watchers_pnts'), as.numeric)
str(fastfood)
## 'data.frame': 1144 obs. of 14 variables:
## $ Company : Factor w/ 6 levels "McDonald’s","Burger King",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Item : chr "Hamburger" "Cheeseburger" "Double Cheeseburger" "McDouble" ...
## $ Calories : num 250 300 440 390 510 740 540 460 510 790 ...
## $ Calories_from_Fat : num 80 110 210 170 230 380 260 220 250 350 ...
## $ Total_Fat_g : num 9 12 23 19 26 42 29 24 28 39 ...
## $ Saturated_Fat_g : num 3.5 6 11 8 12 19 10 8 11 17 ...
## $ Trans_Fat_g : num 0.5 0.5 1.5 1 1.5 2.5 1.5 1.5 1.5 2 ...
## $ Cholesterol_mg : num 25 40 80 65 90 155 75 70 85 145 ...
## $ Sodium_mg : num 520 750 1150 920 1190 1380 1040 720 960 2070 ...
## $ Carbs_g : num 31 33 34 33 40 40 45 37 38 63 ...
## $ Fiber_g : num 2 2 2 2 3 3 3 3 3 4 ...
## $ Sugar_g : num 6 6 7 7 9 9 9 8 8 13 ...
## $ Protein_g : num 12 15 25 22 29 48 25 24 27 45 ...
## $ Weight_Watchers_pnts: num 248 297 433 383 502 ...
Besides uncommon data, we should also handle any missing data. Let’s check how many data is missing in the dataset and decide whether to remove the predictor or impute for their values.
library(naniar)
vis_miss(fastfood)
# fastfood %>% summary()
There are significantly amount of missing data in
Calories from Fat (516) and Weight Watchers
(270) predictors. While we could potentially impute for these variables
using other predictors, they do not provide necessarily additional
information that will assist model prediction, so we should omit these 2
columns. In addition, as mentioned above, Item doesn’t
provide any useful information for predicting Company, so I
will remove it as well.
fastfood <- fastfood[, -4]
fastfood <- fastfood[, -13]
fastfood %>% summary()
## Company Item Calories Total_Fat_g
## McDonald’s :327 Length:1144 Min. : 0.0 Min. : 0.0
## Burger King:190 Class :character 1st Qu.: 140.0 1st Qu.: 0.0
## Wendy’s :154 Mode :character Median : 240.0 Median : 8.0
## KFC :218 Mean : 287.9 Mean :11.7
## Taco Bell :181 3rd Qu.: 390.0 3rd Qu.:18.0
## Pizza Hut : 74 Max. :1220.0 Max. :98.0
## NA's :13 NA's :67
## Saturated_Fat_g Trans_Fat_g Cholesterol_mg Sodium_mg
## Min. : 0.000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 70.0
## Median : 3.000 Median :0.0000 Median : 20.00 Median : 190.0
## Mean : 4.076 Mean :0.1411 Mean : 40.25 Mean : 428.1
## 3rd Qu.: 6.000 3rd Qu.:0.0000 3rd Qu.: 50.00 3rd Qu.: 680.0
## Max. :33.000 Max. :4.5000 Max. :575.00 Max. :2890.0
## NA's :67 NA's :67 NA's :13 NA's :13
## Carbs_g Fiber_g Sugar_g Protein_g
## Min. : 0.00 Min. : 0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 17.00 1st Qu.: 0.000 1st Qu.: 2.0 1st Qu.: 0.000
## Median : 34.00 Median : 0.000 Median : 8.0 Median : 7.000
## Mean : 38.98 Mean : 1.444 Mean : 23.8 Mean : 9.452
## 3rd Qu.: 52.00 3rd Qu.: 2.000 3rd Qu.: 39.0 3rd Qu.:14.000
## Max. :270.00 Max. :31.000 Max. :264.0 Max. :71.000
## NA's :67 NA's :67 NA's :13 NA's :67
Now even though there are still missing data in the dataset, taking a
closer look at these rows, I realize that they are nutrition information
related to drinks, which makes sense that they do not have information
such as total fat. These data probably does not matter as much to our
goal of predicting companies as these are common drinks that are not
unique to companies. Therefore, we can remove them from the data set.
Only 2 of them, Soft Taco Supreme from Taco Bell and
Salad Dressings from McDonald’s are not drinks, but
removing them shouldn’t affect too much for our task.
# fastfood[!complete.cases(fastfood), ]
fastfood <- fastfood[complete.cases(fastfood), ]
fastfood <- fastfood[, -2]
fastfood %>% summary()
## Company Calories Total_Fat_g Saturated_Fat_g
## McDonald’s :326 Min. : 0.0 Min. : 0.0 Min. : 0.000
## Burger King:179 1st Qu.: 150.0 1st Qu.: 0.0 1st Qu.: 0.000
## Wendy’s :154 Median : 250.0 Median : 8.0 Median : 3.000
## KFC :218 Mean : 294.5 Mean :11.7 Mean : 4.076
## Taco Bell :126 3rd Qu.: 400.0 3rd Qu.:18.0 3rd Qu.: 6.000
## Pizza Hut : 74 Max. :1220.0 Max. :98.0 Max. :33.000
## Trans_Fat_g Cholesterol_mg Sodium_mg Carbs_g
## Min. :0.0000 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 80.0 1st Qu.: 17.00
## Median :0.0000 Median : 15.00 Median : 210.0 Median : 34.00
## Mean :0.1411 Mean : 36.98 Mean : 447.5 Mean : 38.98
## 3rd Qu.:0.0000 3rd Qu.: 40.00 3rd Qu.: 710.0 3rd Qu.: 52.00
## Max. :4.5000 Max. :575.00 Max. :2890.0 Max. :270.00
## Fiber_g Sugar_g Protein_g
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 2.00 1st Qu.: 0.000
## Median : 0.000 Median : 7.00 Median : 7.000
## Mean : 1.444 Mean : 22.94 Mean : 9.452
## 3rd Qu.: 2.000 3rd Qu.: 36.00 3rd Qu.:14.000
## Max. :31.000 Max. :264.00 Max. :71.000
Now that all the missing values are handled, let’s explore the data relationship. The first simple thing we might want to see is the distribution of how many menus are there in the dataset from each company.
ggplot(fastfood, aes(x=Company)) + geom_bar(stat="count", fill="#69b3a2")+
ggtitle("Distribution of Menu Items from Fastfood Companies") +
xlab("Company") +
ylab("Count")
As we can see, the dataset is imbalanced as there are more menu items
from McDonald's and least from Pizza Hut,
which encourages us to adopt resampling method to account for this. I
will use k-fold cross validation method. The distribution also tells us
that McDonald's possibly has the most diverse menus
available to customers.
It would also be useful to look at distribution of calories across companies.
fastfood %>%
ggplot( aes(x=Calories, fill=Company) ) +
geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
scale_fill_manual(values=c("#69b3a2","#404080","#FFFFCC","#CCE5FF","#FFCCE5", "red")) +
theme_ipsum() +
labs(fill="") +
facet_wrap(~Company)
df <- fastfood %>%
arrange(desc(Company)) %>%
group_by(Company) %>%
summarise(total=sum(Calories)) %>%
mutate(prop=total/sum(total)) %>%
mutate(label=scales::percent(prop))
df %>%
ggplot(aes(x="", y=prop, fill=Company)) +
geom_bar(stat="identity", width=1, color="white", alpha=0.6) +
coord_polar("y", start=0) +
theme_void() +
geom_text(aes(label = label), color = "black", size=4, position = position_stack(vjust = 0.5))+
scale_fill_manual(values=c("#69b3a2","#404080","#EE45AD","#CCE5FF","#FFCCE5", "red"))
From both the histogram distributions and pie chart, we can see that
McDonald's contributes the most to the total calories from
all the companies, which could be because that McDonald's
has generally higher calories menu or because it has the most amount of
menu items as we seen above. All of them follow a similar distribution
that peaks around 400 Calories and ranges from 0 all the way to 1200
calories. This high upper bound indicates the fact that some menus from
fast food companies are high in calories and should be consumed with
caution.
We should also check beforehand if the predictors have any strong correlations so that we can handle them while creating the receipts.
fastfood %>%
select(is.numeric, -Company) %>%
cor() %>%
corrplot(type = "lower", diag = FALSE, method = "circle", addCoef.col = 1, number.cex = 0.7)
As expected, Calories is highly correlated with
Total Fat, Saturated Fat, Sodium
and Protein. Also, Sugar is highly correlated
with Carbs, which is reasonable since Sugar is
Carbohydrate as explained in the introduction section. I
will include step_pca() to combat these 2 collinearities.
To decide how many components I should include, let’s look at the
proportion of variations explained by pca plot.
recipe(Company ~ . , data = fastfood) %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors()) %>%
prep() %>%
tidy(num=2) %>%
filter(component %in% paste0("PC", 1:4)) %>%
group_by(component) %>%
top_n(8, abs(value)) %>%
ungroup() %>%
mutate(terms = reorder_within(terms, abs(value), component)) %>%
ggplot(aes(abs(value), terms, fill = value > 0)) +
geom_col() +
facet_wrap(~component, scales = "free_y") +
scale_y_reordered() +
labs(
x = "Absolute value of contribution",
y = NULL, fill = "Positive?"
)
We can see that PC1 is about Total Fat,
Calories, Saturated Fat, Protein,
and Sodium, while PC2 is about Sugarand
Carbs. This results corresponds to what we see from the
correlate matrix. Let’s see how much percentage of variance would be
explained by reducing these predictors to 1 or 2 PC components.
var1 <- recipe(Company ~ . , data = fastfood) %>%
step_normalize(all_predictors()) %>%
step_pca(Total_Fat_g, Saturated_Fat_g, Calories, Sodium_mg, Protein_g) %>%
prep() %>%
tidy(num=2, type = "variance") %>%
filter(component %in% paste0(1:4)) %>%
filter(terms %in% c("cumulative percent variance", "percent variance")) %>%
ggplot(aes(x=component, y=value, fill=factor(component))) +
geom_bar(stat="identity") +
facet_wrap(~terms, scales = "fixed") +
labs(y = "Percent Variance",
title="Percentage of Explained Variances\nTotal_Fat_g + Saturated_Fat_g + Calories + Sodium_mg + Protein_g",
fill="PC Component") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
var2 <- recipe(Company ~ . , data = fastfood) %>%
step_normalize(all_predictors()) %>%
step_pca(Sugar_g, Carbs_g) %>%
prep() %>%
tidy(num=2, type = "variance") %>%
filter(component %in% paste0(1:4)) %>%
filter(terms %in% c("cumulative percent variance", "percent variance")) %>%
ggplot(aes(x=component, y=value, fill=factor(component))) +
geom_bar(stat="identity") +
facet_wrap(~terms, scales = "fixed") +
labs(x = "PCA Components", y = "Percent Variance",
title="Sugar_g + Carbs_g",
fill="PC Component") +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
var1+var2+plot_layout(nrow = 2)
From the plot, we can see that by applying step_pca() on
Total_Fat_g, Saturated_Fat_g,
Calories, Sodium_mg, and
Protein_g, PC1 explained over 75% of variance. Similar for
Sugar_g and Carbs_g. Therefore, later when I
create the recipe, I decide to apply 2 step_pca() on the
sets of correlated predictors respectively to handle correlation.
One counterintuitive relation from the correlation matrix is that
while sodium is strongly positively related with
total fat, sugar is weakly negatively related.
Personally, I have assumed that higher amount of sugar would also lead
to higher fat. Let’s take a closer look at the relation between
Total Fat and Sugar.
fastfood %>%
ggplot(aes(x=Sugar_g, y=Total_Fat_g)) +
geom_jitter(width = 0.5, size = 1) +
geom_smooth(method = "lm", se =F, col="red") +
labs(title = "Total Fat (g) vs. Sugar (g)")
From the graph, there seems to be 3 relationships going on. One group
of data has the relationship that Total Fat increases
regardless of Sugar. The second group of data has the
relationship as Sugar increases, Total Fat
also increases. The third group has the relationship where
Total Fat doesn’t change as Sugar increases.
After taking a look at the data, I realize the third group comes mainly
from menus that are marked as nonfat and the first group comes from high
fat menus, such as pizza or fried chickens. The correlation is dominated
by the number of 3rd group, which leads to the negative relationship we
saw earlier. To see a better relationship, let’s draw the 3
relationships independently.
df <- fastfood %>%
select(Total_Fat_g, Sugar_g) %>%
mutate(Group = case_when(Sugar_g<15 ~ 1,
Total_Fat_g<3 ~ 2,
TRUE ~ 3)) %>%
mutate(Group = factor(Group))
df %>%
ggplot(aes(x=Sugar_g, y=Total_Fat_g, color=Group)) +
geom_point(alpha=0.6) +
labs(title = "Total Fat (g) vs. Sugar (g)") +
geom_smooth(method = "lm", se =F, alpha=0.6) +
scale_color_manual(values=c("#69b3a2","deepskyblue","grey"),
labels=c("High Fat Low Calories",
"Low Fat High Sugar",
"Other"))
Now that having a better idea of how our data looks like we can finally build and train our models!
First thing to do is to split our data into training and testing set. On the training data, we should also create folds for cross validation to deal with the imbalanced dataset. Also, remember to set the seed so that other people can reproduce our results! We also verify at the end that the training and testing data are split properly according to the 0.7 proportion we specified.
set.seed(1105)
fastfood_split <- initial_split(fastfood, prop = 0.7, strata = Company)
fastfood_train <- training(fastfood_split)
fastfood_test <- testing(fastfood_split)
fastfood_folds <- vfold_cv(fastfood_train, v = 7, strata=Company)
c(nrow(fastfood_train)/nrow(fastfood), nrow(fastfood_test)/nrow(fastfood))
## [1] 0.6982358 0.3017642
Next, we should create our recipe for our models. Originally, there
are 11 predictors, including Calories,
Calories from Fat, Total Fat(g),
Saturated Fat(g), Trans Fat(g),
Cholesterol(mg), Sodium(mg),
Carbs(g), Fiber(g), Sugars(g),
Protein(g). As mentioned above, we apply 2
step_pca to the 2 sets of correlated predictors:
Total Fat, Calories,
Saturated Fat, Protein,
SodiumSugar, Carbsafter normalizing them. At the end, there are 6 predictors left in the recipe as shown below.
fastfood_recipe <- recipe(Company ~ . , data = fastfood_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors()) %>%
step_pca(Total_Fat_g, Saturated_Fat_g, Calories, Sodium_mg, Protein_g,
num_comp = 2, prefix = "first_pc") %>%
step_pca(Sugar_g, Carbs_g,
num_comp = 1, prefix = "sec_pc")
prep(fastfood_recipe) %>% bake(new_data = fastfood_train) %>% head()
## # A tibble: 6 Ă— 7
## Trans_Fat_g Cholesterol_mg Fiber_g Company first_pc1 first_pc2 sec_pc1
## <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 2.69 0.799 0.223 Burger King -3.60 -0.433 0.00531
## 2 3.69 1.18 0.223 Burger King -4.79 -0.473 0.0278
## 3 3.69 1.33 0.223 Burger King -5.46 -0.320 0.0278
## 4 5.67 2.08 0.223 Burger King -6.26 -1.19 0.00531
## 5 5.67 2.38 0.223 Burger King -7.46 -1.23 0.0278
## 6 7.65 3.28 0.223 Burger King -8.84 -1.95 0.00531
Finally, we are ready to build our models! I will be training 5
different models, including Logistic Regression,
K-Nearest Neighbors, Elastic Net,
Random Forest, and Boosted Tree. Different
hyperparameters for different models will be tuned to produce the best
model performance. Given the number of models and folds for cross
validation, the model training takes a long time. Therefore, the results
are all saved to RDA files, which could be loaded without retraining the
models.
log_reg <- multinom_reg(penalty = 0) %>%
set_engine("glmnet") %>%
set_mode("classification")
log_wkflow <- workflow() %>%
add_model(log_reg) %>%
add_recipe(fastfood_recipe)
knn <- nearest_neighbor(neighbors = tune()) %>%
set_mode("classification") %>%
set_engine("kknn")
knn_wkflow <- workflow() %>%
add_model(knn) %>%
add_recipe(fastfood_recipe)
en <- multinom_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet") %>%
set_mode("classification")
en_wkflow <- workflow() %>%
add_recipe(fastfood_recipe) %>%
add_model(en)
rf <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_wkflow <- workflow() %>%
add_model(rf) %>%
add_recipe(fastfood_recipe)
bt <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
bt_wkflow <- workflow() %>%
add_model(bt) %>%
add_recipe(fastfood_recipe)
knn_grid <- grid_regular(neighbors(range=c(1, 10)), levels = 10)
en_grid <- grid_regular(penalty(), mixture(range=c(0, 1)), levels=10)
rf_grid <- grid_regular(mtry(range = c(1, 6)),
trees(range = c(200, 600)),
min_n(range = c(10, 20)),
levels = 5)
bt_grid <- grid_regular(mtry(range = c(1, 6)),
trees(range = c(200, 600)),
learn_rate(range = c(-10, -1)),
levels = 5)
log_fit <- log_wkflow %>%
fit_resamples(fastfood_folds)
knn_tune <- tune_grid(
object = knn_wkflow,
resamples = fastfood_folds,
grid = knn_grid
)
en_tune <- tune_grid(
object = en_wkflow,
resamples = fastfood_folds,
grid = en_grid,
control = control_grid(verbose = TRUE)
)
rf_tune <- tune_grid(
rf_wkflow,
resamples = fastfood_folds,
grid = rf_grid,
control = control_grid(verbose = TRUE)
)
bt_tune <- tune_grid(
bt_wkflow,
resamples = fastfood_folds,
grid = bt_grid,
control = control_grid(verbose = TRUE)
)
save(log_fit, file = "log_fit.rda")
save(knn_tune, file = "knn_tune.rda")
save(en_tune, file = "en_tune.rda")
save(rf_tune, file = "rf_tune.rda")
save(bt_tune, file = "bt_tune.rda")
Let’s take a look at the model performance!
load("log_fit.rda")
load("knn_tune.rda")
load("en_tune.rda")
load("rf_tune.rda")
load("bt_tune.rda")
log_auc <- collect_metrics(log_fit) %>%
filter(.metric=="roc_auc") %>%
select(mean) %>%
mutate(Model = "Logistic Regression", .before=mean)
knn_auc <- show_best(knn_tune, metric = "roc_auc")%>%
filter(.metric=="roc_auc", row_number()==1) %>%
select(mean) %>%
mutate(Model = "KNN", .before=mean)
en_auc <- show_best(en_tune, metric = "roc_auc")%>%
filter(.metric=="roc_auc", row_number()==1) %>%
select(mean) %>%
mutate(Model = "Elastic Net", .before=mean)
rf_auc <- show_best(rf_tune, metric = "roc_auc")%>%
filter(.metric=="roc_auc", row_number()==1) %>%
select(mean) %>%
mutate(Model = "Random Forest", .before=mean)
bt_auc <- show_best(bt_tune, metric = "roc_auc")%>%
filter(.metric=="roc_auc", row_number()==1) %>%
select(mean) %>%
mutate(Model = "Boosted Tree", .before=mean)
bind_rows(list(log_auc, knn_auc, en_auc, rf_auc, bt_auc)) %>%
arrange(desc(mean)) %>%
rename(ROC_AUC=mean)
## # A tibble: 5 Ă— 2
## Model ROC_AUC
## <chr> <dbl>
## 1 Boosted Tree 0.876
## 2 Random Forest 0.872
## 3 KNN 0.845
## 4 Elastic Net 0.709
## 5 Logistic Regression 0.708
From the table we can see that Boosted Tree,
Random Forest, and KNN perform the best in
terms of ROC_AUC. Let’s take a look at their individual
results closely through plots.
bt_tune %>% autoplot()
Interesting enough, the number of trees doesn’t affect the
performance of Boosted Tree. The best performance
(ROC_AUC = 0.876) is achieved when learning rate is 0.1
with 1 randomly selected predictor at each split when creating the tree
models.
rf_tune %>% autoplot()
Similarly for
Random Forest, number of trees doesn’t affect
the performance too much. The best performance
(ROC_AUC = 0.872) is achieved when minimal node size is 10
with 2 randomly selected predictors at each split when creating the tree
models.
knn_tune %>% autoplot()
Lastly, the performance of KNN model improves as the
number of nearest neighbors increases. The best performance
(ROC_AUC = 0.845) is achieved with 10 number of nearest
neightbors. Presumably, KNN might perform even better if we
allow more number of nearest neighbors to be tuned; however, we also
need to remember that KNN could overfit easily to our
training data if we keep increasing the number of nearest neighbors, so
we will stop at 10 neighbors.
Let’s prepare our top 2 best performance models
Boosted Tree and Random Forest for testing to
see how they perform on unseen data. We will select the best performing
Boosted Tree and Random Forest models and fit
them to the testing data.
best_bt <- select_best(bt_tune, metric = "roc_auc")
bt_final_fit <- finalize_workflow(bt_wkflow, best_bt) %>%
fit(fastfood_train)
best_rf <- select_best(rf_tune, metric = "roc_auc")
rf_final_fit <- finalize_workflow(rf_wkflow, best_rf) %>%
fit(fastfood_train)
bt_test <- augment(bt_final_fit, fastfood_test) %>%
select(Company, starts_with(".pred")) %>%
clean_names() %>%
roc_auc(company, pred_mc_donald_s:pred_pizza_hut)
rf_test <- augment(rf_final_fit, fastfood_test) %>%
select(Company, starts_with(".pred")) %>%
clean_names() %>%
roc_auc(company, pred_mc_donald_s:pred_pizza_hut)
bind_rows(list(bt_test, rf_test)) %>%
select(.estimate) %>%
mutate(Model = c("Boosted Tree", "Random Forest"), .before=.estimate) %>%
arrange(desc(.estimate)) %>%
rename(ROC_AUC=.estimate)
## # A tibble: 2 Ă— 2
## Model ROC_AUC
## <chr> <dbl>
## 1 Boosted Tree 0.888
## 2 Random Forest 0.876
Yeah!! Our models Boosted Tree and
Random Forest both perform pretty well even on testing
data, with ROC_AUC=0.888 and ROC_AUC=0.876
respectively. Impressive! This is not what I expected since I assumed
all fastfood companies would probably have pretty similar high calories
menus that are not easy to be distinguished from each other. Can you
guess which company is the easiest for Boosted Tree to
predict? The answer is….
augment(bt_final_fit, fastfood_test) %>%
select(Company, starts_with(".pred")) %>%
clean_names() %>%
roc_curve(company, pred_mc_donald_s:pred_pizza_hut) %>%
autoplot()
augment(bt_final_fit, fastfood_test) %>%
select(Company, starts_with(".pred")) %>%
conf_mat(Company, .pred_class) %>%
autoplot(type = "heatmap")
Pizza Hut?! The company with the least number of menu items in our
training data?! From the confusion matrix, we can see that out of all 15
Pizza Hut menu items, Boosted Tree predicted
13 of them correctly. Pizza Hut also has the largest
ROC_AUC. Out of curiosity, I found a Pizza Hut
menu online that is not included in our dataset. Let’s see if our model
can predict it correctly.
hot_honey_double_pepperoni <- data.frame(
Company = "Pizza Hut",
Calories = 360,
Total_Fat_g = 17,
Saturated_Fat_g = 7,
Trans_Fat_g = 0,
Cholesterol_mg = 45,
Sodium_mg = 780,
Carbs_g = 38,
Fiber_g = 3,
Sugar_g = 6,
Protein_g = 15
)
predict(bt_final_fit, hot_honey_double_pepperoni, type = "class")
## # A tibble: 1 Ă— 1
## .pred_class
## <fct>
## 1 Pizza Hut
Amazing! The Boosted Tree model indeed predicts this new
menu to come from Pizza Hut. Let’s probe into the model a
bit and see if we can figure out why it predicts like this.
features <- c('Calories', 'Total_Fat_g', 'Saturated_Fat_g', 'Trans_Fat_g',
'Cholesterol_mg', 'Sodium_mg', 'Carbs_g', 'Fiber_g', 'Sugar_g',
'Protein_g')
vip_train <- fastfood_train %>% select(all_of(features))
explainer_bt <- explain_tidymodels(
bt_final_fit,
data=vip_train,
y=fastfood_train$Company,
label="Boosted Tree",
verbose=FALSE
)
predict_parts(explainer = explainer_bt, new_observation = hot_honey_double_pepperoni) %>% plot()
From the plot, we can see that Total Fat=17 contributes
the most to the model’s prediction on Pizza Hut being the
company where the new menu comes from. Not entirely sure explanation for
this since given the coplexity of Boosted Tree, these
models are usually less explanable. However, one thing for sure is that
the models are trained well to classify menus from different
companies.
In this project, we explored, analyzed, and trained models on the
nutrition facts from different fastfood companies’ menus, trying to see
if these companies hold any difference in nutrition facts that will set
them apart. To my surprise, our models are trained really well in
classifying menus to the correct company even though I thought all the
fastfood will be generally the same. Out of all the models, we see that
Boosted Tree performed the best with an
ROC_AUC=0.888 performance on the testing data.
From the results, we can probably say that each company has their own
unique recipe that set them apart from each other. Especially for
Pizza Hut, they might have some secret recipe that set them
apart from other fastfood companies. Unfortunately, given the complex
behavior of our models, it is hard to probe into why and how the models
predict this way. In the future, it would be helpful to dig deep into
the model behaviors and interpret the reasons behind model
decisions.
Arvidsson, Joakim. Fast Food Nutrition, 2023. Kaggle. Date accessed: Mar 18th, 2024. https://www.kaggle.com/datasets/joebeachcapital/fast-food/code